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Abstract 



We propose a new approach for clustering DNA features using array CGH data from 
Oh ■ multiple tumor samples. We distinguish data-collapsing: joining contiguous DNA clones 

I or probes with extremely similar data into regions, from clustering: joining contiguous, 

correlated regions based on a maximum likelihood principle. The model-based cluster- 
^ I ing algorithm accounts for the apparent spatial patterns in the data. We evaluate the 

c/3 i randomness of the clustering result by a cluster stability score in combination with cross- 

validation. Moreover, we argue that the clustering really captures spatial genomic de- 
^SJ . pendency by showing that coincidental clustering of independent regions is very unlikely. 

K*" ' Using the region and cluster information, we combine testing of these for association with 

. a clinical variable in an hierarchical multiple testing approach. This allows for interpret- 

ing the significance of both regions and clusters while controlling the Family- Wise Error 
jy-^ . Rate simultaneously. We prove that in the context of permutation tests and permutation- 

invariant clusters it is allowed to perform clustering and testing on the same data set. 
Our procedures are illustrated on two cancer data sets. 



Introduction 



^ ' Array Comparative Genomic Hybridization (array CGH) was designed as a liigh-resolution 

^ . measurement device for copy number aberrations, which are known to be involved in the 

cancer development process (Kallioniemi, 2008). Through hybridization, fluorescent labeled 
DNA is extracted from test and reference samples resulting in log-ratios of intensities of the 
two types of samples. When plotted against chromosomal position, the log-ratio data appears 
as segments at various amplitudes. One is particularly interested in those chromosomal 
segments which show levels of loss or gain, because those segments may possibly harbor 
oncogenic genes, cancer progression markers or other relevant features. Array CGH data is 
often encoded as loss (deletion), normal and gain by assigning discrete copy number states 
to chromosomal segments. Generally, two processing steps are performed for this purpose: 
segmentation and calling. 

Segmentation denotes a process to find breakpoints from single copy number profiles. 
Many segmentation algorithms have been proposed. Willenbrock and Fridlyand (2005); Lai 
et al. (2005) discuss and compare several of those, while Diaz-Uriarte and Rueda (2007) 
provide software to simultaneously apply several popular algorithms to the same data set. 
Recently, new algorithms have been introduced to deal with the computational burden caused 
by the increasing resolution of the arrays (f.e. Pique- Regi et al., 2008). After segmentation. 
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assigning discrete states to the segmented data is referred to as calling (Van de Wiel et al., 
2006). 

Here, we will assume that the data has been properly segmented and discretized. While 
the measurement unit is clearly defined from the array platform used, there is no a priori 
biologically most relevant unit of interest in DNA copy number experiments. This contrasts 
the majority of mRNA experiments, for which the genes as coding regions for proteins are 
a natural relevant unit. By nature, the crucial DNA copy number events, aberrations, arise 
when a piece of DNA is either deleted or gained. Such a piece can be an entire chromosomal 
arm, but also just 1/lO^th of an arm. Therefore, we aim to learn the relevant units from 
the data. Such units are particularly important for interpretation and power when testing 
association with clinical data. For simplifying terminology, we refer to the array CGH features 
as probes although this could also reflect clones or cDNAs. Then, our proposed two-step 
dimension reduction approach enables one to consider the data at two levels with decreasing 
resolution: regions and clusters. The regions are the result of collapsing neighboring probes of 
which the discretized values are highly repetitious and redundant within (almost) all profiles 
(Van de Wiel and Van Wieringen, 2007), while the clusters contain contiguous similarly 
behaving regions. 

After collapsing, our study proceeds in two directions. First, wc develop a model-based 
clustering algorithm that considers spatial patterns. Second, we perform simultaneous region- 
wise and cluster-wise multiple testing. The idea of multiple testing based on clustered results 
is not new. Benjamini and Heller (2007) adapt the False Discovery Rate (FDR) to allow 
for cluster-level multiple testing in the analysis of functional Magnetic Resonance Imaging 
(fMRI) data. Moreover, comparing to voxel-wise (feature level) multiple testing, they asserted 
that their method improved SNR (signal to noise ratio) and gained statistical power. Even 
though both array CGH data and fMRI data have strong spatial correlation patterns, we 
cannot apply the methodology of Benjamini and Heller (2007) directly. Using fMRI it is 
often feasible to obtain the clusters of voxels from independent, preparatory scans. In array 
CGH data, however, it is hardly possible to obtain such information a priori. One solution 
is splitting the set of samples in two parts, where one is used for clustering and the other 
for testing, which guarantees independence between clustering and testing. However, in that 
case, we should accept significant power loss due to splitting. Perone Pacifico et al. (2004) 
offer an alternative solution, also in the context of imaging data and FDR: clustering on the 
basis of p- values. Their theory is based on smooth Gaussian random fields, which is not a 
realistic model for array CGH data. Moreover, this would render meaningful clusters from a 
testing perspective only. 

We show that, in the context of permutation testing and family- wise error rate (FWER), 
it is possible to apply both clustering and testing to the same data set if the cluster result 
is permutation invariant. Wc consider selection of a suitable test statistic on both levels 
(regions and clusters) and use an hierarchical testing procedure to control the FWER. The 
entire clustering plus testing procedure is illustrated on two array CGH cancer data sets. 

High-dimensionality 

High-dimensionality is a common theme for the analysis of data produced by high-throughput 
genomic technologies. It denotes the situation that the number of features is much larger than 
the number of samples. Due to the large dimension, traditional multiple testing procedures 
such as Bonferroni yield very conservative results. 



While the term 'high-dimensionaUty' is used neutrally in terms of the number of features, 
some aTithors note that technological developments in data generating processes can allow a 
more specific approach for the data. Benjamini and Heller (2007) suggest that the assumption 
of infinitely many elements in a cluster is not unrealistic for fMRI data in an environment of 
improved technical resolution. A similar development has occurred for our type of genomic 
data. Here, the resolution increased from chromosome arms 40) to Bacterial Artificial 
Clones 3000), to oligos (< 400,000), to next generation arrays (> 1,000,000) (Park, 
2008). The increasing number of features does not necessarily imply an increasing number 
of 'distinct' relevant units, because these new features may appear as repetitions of similar 
copies on a coarser scale. 

We briefiy illustrate the above idea. Consider the extreme case in which a certain region 
of the genome spanning r probes never contains a genomic breakpoint for the type of samples 
under study. Doubling the resolution in this genomic region to 2r will only lead to measuring 
more of the same (due the absence of a breakpoint) and hence for both resolutions the relevant 
unit can be collapsed to one data point per sample for this genomic region. 

Discretized array CGH data usually appears as large numbers of distinct blocks which 
consist of probe vectors with (almost) the same discretized status (across samples) for as 
many as hundreds or thousands of probes. So, regions in the data correspond to distinct 
blocks. Those regions may be considered as biologically relevant units for the type of samples 
under study. 

Collapsing repetitious probes is useful, since it reduces unnecessary dimension of data and 
enhances computational convenience. Conventional clustering methods using correlational 
association among probes seem not appropriate in this stage, because these may yield mingled 
clusters: repetitiously copied probes mixed with similarly behaving probes. Hence this may 
weaken the performance and interpretation of the clustering methods usually performed in 
the next stage. From this point of view, it is desirable to separate two procedures: collapsing 
and clustering. The first step handles the technical resolution and is used for dimension 
reduction. The latter step is applied for combining similar behaving probe regions so that 
one may obtain hints of collective behaviors within certain genomic neighborhoods. 

We do not suggest a new algorithm for the collapsing process, but instead use the one 
proposed by Van de Wiel and Van Wieringen (2007). Here, the relative amount of information 
lost by collapsing can be controlled; we use 0.5% as an upper bound. Note that this algorithm 
maintains the high resolution where desired: small genomic segments that differ consequently 
from their neighbors are kept. In construction of the clustering models, we incorporate the 
collapsing information into our model in two ways: either via base-pair distance between two 
regions or via the number of collapsed probes between two regions. 

Methods 

Mathematical setting 

In this paper, the overall statistical model corresponds to n i.i.d. copies of a pair of random 
variables denoted by {X^,Y^), {X^,Y^), where, for each individual i = l,...,n, G 
{0,-1,1}"^ is the state vector along all the (ordered) regions and G {0)1} is a label 
determining which group the individual belongs to. In this model X = {X^, ...,X^) is a 
sample of i.i.d. variables, in which the distribution of X' can be seen as a mixture between 



the two groups = and = I. In particular, note that the distribution of the X^'s does 
not involve the relationship between the Y^'s and the X*'s. 

In a nutshell, our approach firstly clusters the rows of X that are similar from a model- 
based point of view and secondly detects the regions and clusters that are significantly asso- 
ciated to the label vector Y = {Y^, 

• The clustering phase uses the variables X, by assuming a specific parametric model for 
the distribution of the X^'s . Since X^,...,X^ are i.i.d., the clustering result A{X) is 
invariant under permutations of the columns of X. 

• The testing phase tests the independence between regions Xj''^ = (Xj)i<j<„ and Y 

and between clusters {Xj''^)j^A and Y, conditional on the clustering A{X). The (con- 
ditional) p-values are computed by performing permutations of the observed labels, 
which is valid because A(X) is permutation invariant. Then, these p-values are inte- 
grated in an hierarchical multiple testing procedure controlling the family-wise error 
rate both on the region and cluster level. 

Clustering model 

Our motivation for spatial clustering comes from Figure 1. In Figure 1, we illustrate Kendall 
correlations for pairs of all regions for two discretized array CGH data sets. Strong correlations 
are concentrated in the diagonal parts, so spatial dependency patterns are notably block- 
wise structures. Such correlations are caused by a high likelihood of the same aberration to 
span multiple consecutive regions (as opposed to off-diagonal correlations between regions on 
different chromosomes, which necessarily represent different aberrations). We incorporate the 
spatial correlation in our modeling approach. Log-linear models are a natural candidate for 
describing discrete data. Hence, wc apply such models for our purpose, while adapting these 
to take specific data characteristics such as spatial correlations, but also physical base pair 
distance between regions, into account. Moreover, we consider practical concerns for efficient 
optimization of the clustering. 

Let m be the number of regions and n be the number of i.i.d. samples under consideration. 
Moreover, let each region be indexed by its spatial location on the DNA, that is, the (j — l)-th 
and (j + l)-th regions are the neighbors of the j-th region. The goal is to identify clusters so 
that regions in one cluster behave similarly. 

We first consider the quadratic exponential model (Cox and Wermuth, 1994) which ap- 
proximates log-linear models for binary data up to 2nd order interactions. In our case we 
consider trinary valued variables. Because the samples arc initially assumed i.i.d., we suppress 
the sample index when describing the model, so Xj = xj refers to the observed state of the 
jth region. Hence, for observations xa = {xj)j£A, with Xj € {0, —1, 1} and ^4 C {1, . . . ,m}, 
the marginal distribution of the corresponding random variable Xa, Pa, is modeled as 



where f{xj,Xk) = —1, if xj ^ x^ and f{xj,Xk) = XjXk, otherwise. Note that is not a 
free parameter, because it is determined by the 'summing-to-one' condition. The discrete 
values '0', ' — 1' and '1' denote normal (= 2 copies), loss (< 2 copies) and gain (> 2 copies), 
respectively. We do not consider amplification (high copy number gain) and double deletion 
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Figure 1: Two correlation (Kendall's r) heatmaps for Chin et al. (2006) (left) and Douglas 
et al. (2004) (right) data sets. Data sets contain 383 and 436 regions, respectively. Regions 
are plotted according to chromosomal position. Colors represent correlations from -1 (cyan) 
to 1 (pink). 



(0 copies) as separate states because of computational complexity of the clustering algorithm, 
but alternative solutions to accommodate such different states are discussed in the Discussion. 

Since our purpose is not to specify every 2nd order interaction ^A,jk, but to bind features 
which are in the same cluster, it may be better to simplify the above model by assigning one 
average 2nd order interaction per cluster. Since in actual array CGH data we observe physical 
base-pair distance between two regions, we use this information in the model. We assume the 
dependence of two regions becomes weaker for more distant pairs of regions (similar to p. 430 
of Cressie (1993)). Hence the interaction parameter "jAjk is represented as 'jAdjk and 

J / ^~k / ^^^{d7^} if j and k are in the same chromosome, 

djk ~ 1 

[ otherwise. 

where djk is the base-pair distance between two regions j and k and the maximum is computed 
over pairs in the same chromosome. Alternatively, the distance function can be used to 
incorporate the collapsing information; for example, the distance between two regions j and 
k can be the actual number of probes (measurement unit) between them. Here, q is the 
decreasing rate of spatial dependence. Common choices for q are 0, 1,2 (Cressie, 1993). Let 
6i = {aAi, (3ai,i, ■ ■ ■ , PAe,me,lAi), where Al is a set of contiguous regions and is the number 
of regions in A£. Then, our full model for a partition A = {Ai}f^^ of {1, ... , m}, where Ka 
denotes the number of clusters, is 

Pa{x) = Y{pAe{xAt;Oe), (1) 

e=i 

with 

logpAe{xAe;6e) = aAe+^ PAejXj + jAe ^ djkf{xj,Xk). (2) 

jeAe j<:k,j,k€Ae 



Prom (1), we know that Xa^ and Xa^, are assumed independent for different i and t' . Denote 
the value of xa^ for sample i = 1, . . . , n by .t^^. Then, we search a partition and parameter 
setting that optimizes the following log-likelihood: 



{A,e)=wgm.&yi^^\ogpAi{x\\ei) (3) 



■^^^'^ i=i A.eA 



where C is the set of contiguous partitions and 9 = {0e)f^i- Note that the model (1) contains 
an intrinsic trade-off for dividing a cluster into two subclusters: potentially higher likelihood 
due to an extra parameter 7 (a is not a free parameter) with respect to one cluster versus 
potentially lower likelihood due to the forced independency for two regions in a different 
subcluster. Which effect prevails depends on how similar the regions between two subclusters 
are. As a consequence, the number of clusters i^_4 is not defined a priori, but instead computed 
through the optimization step. 

Generally, searching the space C for large m is computationally intractable. Therefore, we 
restrict the maximum size of a cluster. The components of the product in the right hand side 
of (3) are computed in two steps; first, we estimate the parameters of each cluster smaller 
than the maximum size. From these parameters, the likelihood of each candidate contiguous 
partition can be determined. Second, given the likelihoods, we search the partition space and, 
due to the specific partition structure (Gormen et al., 2000), we achieve the global optimum for 
the equation (3) by applying Dijkstra's shortest path algorithm. Note that, from a biological 
perspective, a cluster should not cover more than one chromosome, so we apply the algorithm 
per chromosome separately. 

One could question the usefulness of this cluster algorithm when the results would be 
sensitive to similar region-wise differences between, say, two clinical groups. However, observe 
from (2) that the likelihood for a given partition Ai is to a large extent determined by within- 
sample similarity. This means that, as opposed to a p-value type clustering method, our 
method is unlikely to cluster two independent consecutive regions that coincidentally have a 
similar group difference, because the independence implies that these two differences are likely 
to be caused by (partly) different samples. Moreover, in terms of causality, the perception 
is that the DNA copy number data (which are early onset markers) potentially affects the 
group labels, rather than the other way around. Nevertheless, in rare cases, two independent 
consecutive regions may have realizations that coincide for many samples, in particular when 
the marginal distributions within groups would (almost) degenerate (e.g. all O's in one group 
and all I's in the other). This may cause coincidental clustering. Therefore, we investigate 
how likely such coincidental clustering is for a given data set using a shuffling argument. 

Denote the event that a random pair of consecutive regions clusters by E, and the event 
that these two regions are dependent due to their genomic proximity by D. Its complement, 
lack of genomic dependency, is denoted by D'. If our cluster method performs properly 
P{D\E) should be very high. Applying Bayes' rule, we have 

P^Pin _ P{E\D)P{D) _ P{E\D)P{D) 
^ ' ' P{E) P{E\D)P{D) + P{E\D')P{D'y ^' 

Prom the data we easily estimate P{E) by counting consecutive pairs of regions that cluster. 
Then, if P{E\D') is very small and P{E) is fairly large, we necessarily have: P{E\D)P{D) » 
P{E\D')P{D'), which implies P{D\E) « 1. To show that P{E\D') is smaU, we break the 



genomic dependency structure by shuffling regions such that two regions of the same chromo- 
some are not allowed to be neighbors in the shuffled data set. Then, we apply our clustering 
method to the shuffled data set. If very few regions cluster, then, necessarily, P(E\D') is 
small. The shuffling is repeated several times to account for the arbitrariness of the shuffling. 



Clustering stability 

Since a clustering result depends on the data used, especially on the sample size, there exists 
uncertainty on the clustering result. We suggest to use the adjusted Rand index (Hubert 
and Arabie, 1985) to evaluate stability of the clustering results. If a small change of the 
data results in large change of the clustering result, then the clustering method is instable. 
Therefore, we check the stability using a cross validation framework. 

Suppose regions are numbered from 1 to m and a clustering result is represented as a 
contiguous partition of {!,... ,m} as in the previous section. Let A = {A£}f^^ and A' = 
{A'^i}^^'i be two such partitions which we aim to compare and let \A\ be the cardinality of 
index subset A. Then the adjusted Rand index (ARI) for the two partitions is defined as 

where r is the observed value of random variable R. The expectation of R is computed 
conditional on the fixed margins, \A(\, \A£i\,i = 1, . . . , Kj[, I' = I, . . . , K^i and the maximum 
is computed as the maximum of R regardless of the fixed margins. The adjusted Rand index 
takes numbers between —1 to 1. 

After obtaining V clustering results from F-fold cross validation, we compute the average 
adjusted Rand index, aARI, which serves as our stability measure: 

aARI =yYl ARI(^, v4(^) ) (5) 

v=l 

where A is the original clustering result from the full data and A^'"^ is the v-th cross- validated 
clustering result. 



Hierarchical multiple testing for clusters and regions 

Once we have the clusters A = {Ai}f^^, our focus turns to testing association of the region 
and cluster- wise data with clinical information. Let us first shortly consider association tests 

between groups (reference and test) and states (loss, normal and gain) for a 2 x 3 contingency 
table n = (nii,ni2,ni3, 7121,^122,^23) as Table 1. Let ngs be the population proportion for 



group \ state 


loss 


normal 


gain 




reference 


nil 


nu 


ni3 


ni+ 


test 


"-21 


n22 


n23 


"2+ 




n+i 


n+2 


n+3 


n++ 



Table 1: Contingency table representation of copy number aberrations between two groups. 



group g and state s. Then, for each region j we test 



Hoj : for any g,s, TTgsj = TfgjTrsj- 



(6) 



For a cluster A^, we simultaneously test 



(7) 



Besides the usual scala of test statistics for a particular testing problem, two classes of 
approaches may be distinguished: unconditional and conditional tests. In the first case, 
usually only the sample sizes (row margins) are fixed while column margins are variable. In 
the second case, both margins are fixed. Permutation tests fall in the latter category, since 
these only permute the labels (reference or test) for all samples. Permutation tests are useful 
to approximate (summaries of) multivariate distributions, which is exactly what we need for 
testing association on the level of clusters. Therefore, we use permutation tests in combination 
with the popular Pearson statistic for determining the significance of regions and clusters. 

A test statistic for testing (7) is = minj^AePj, where pj is the Rvalue obtained for 
region j in cluster A^. Using the p- values to define the cluster test statistic standardizes 
the regions within a cluster. Before further motivating the use of M^, we first introduce the 
hierarchical testing approach. 

The hierarchical multiple testing procedure we propose to use for the cluster-region data 
is based on that by Meinshausen (2008). Such a procedure allows to test both clusters and 
regions within one multiple testing framework. The procedure controls Family- Wise Error 
Rate (FWER) for hierarchical hypotheses. Why FWER and not the more popular False 
Discovery Rate (FDR)? Firstly, on the level of regions (usually a few hundreds) FDR may 
not refiect what it should, because a large 'cluster' of highly correlated regions could have a 
disproportional large share in the number of discoveries, which highly impacts the estimated 
FDR for other regions. This is a consequence of the fact that FDR does not provide control on 
subsets (Finner and Roters, 2001). Secondly, our constructed clusters are more independent 
than regions, but their number is much lower (often below 100). Then, the power enhancement 
of an FDR procedure w.r.t. FWER is usually quite subtle, which may not outbalance the 
stronger conclusion one is allowed to draw with FWER control. 

The procedure by Meinshausen (2008) assumes an a priori known clustering. Such a 
setting can be relevant for these data as well, but often it is preferable to use the same data 
for both clustering and testing (see Discussion). In a clustering plus permutation-testing 
framework, control of the FWER conditional to the data-based clustering is feasible when 
the clustering is permutation invariant: when testing Hq using permutations, we permute 
the clinical responses (group labels Y in our case), while keeping the regions Xj'"^ fixed for 
j E Af . If the result of the cluster algorithm is column-wise permutation invariant, the clusters 
may, in the testing phase, be assumed to be known when the null-hypotheses are formulated 
conditionally to the clusters. More mathematical details on the validity of permutation tests 
in this setting are presented in Appendix II. A similar argument for combining permutation 
testing with clustering is given by Goeman and Finos (2010). 

The cluster method introduced in this paper is permutation-invariant, because it does 
not use the group labels. We use the sequential hierarchical testing procedure proposed by 
Goeman and Finos (2010), with critical values that depend on the rejection set TZ, which 
contains both the rejected clusters and regions. This procedure is a slightly more powerful 



alternative to the one by Mcinshausen (2008). It applies the so-called inheritance principle 
in combination with the Shaffer (1986) improvement in an hierarchical testing context. The 
inheritance principle is a variation of the fall-back principle (Wiens and Dmitrienko, 2005) 
that allows to test an hypothesis less strictly (applying a larger p-value threshold) when 
neighboring hypotheses are rejected. Likewise, the Shaffer improvement allows use of a larger 
p-value threshold for regions in a cluster, because it uses the connection between the cluster 
and region hypotheses: if Hq is false, at least one Hqj should be false too for j € Ai (see 
Appendix II). We emphasize that this procedure guarantees strong control of FWER (Goeman 
and Finos, 2010). It relies only on the individual permutation p- values of clusters and regions. 
It is a Holm- type procedure and hence subset pivotality (Dudoit et al., 2003) is not required. 

For a given rejection set TZ (both for clusters in A and regions in {1, m}), critical values 
for clusters are defined by an = a/ {Kj\^ — D-ji), where D-ji equals the number of clusters in TZ 
for which all regions are members of 7?. as well. Note that we opt, as opposed to Meinshausen 
(2008), to weigh clusters equally, because for our application small clusters may be as relevant 
as large ones. This does not affect control of FWER (see Appendix III for a proof). Critical 
values for regions in cluster are denoted by a-jz^i = a-jz/dAi] — max(l, Dt^^^)), where -D7^/ 
equals the number of regions in cluster A^ that are members of TZ. 

Then, the hierarchical testing procedure, which is initiated by 7?. = 0, is as follows. 

1. Reject Hq for cluster Ag if Po(M^ < mg) < a-ji, where m£ is the realization of M^. 

2. If Hq is rejected, reject Hqj for region j in cluster Ag if pj < a-jz^g. 

3. Update the rejection set TZ and critical values a-jz^g. 

4. Repeat steps 2 and 3 for the non-rejected regions until no more regions are rejected. 

5. Update cluster critical values aji. 

6. Repeat steps 1 to 5 for the non-rejected clusters and regions therein. 

7. Stop when no hypothesis is rejected anymore. 

One could argue that for testing clusters Mg = miuj^AePj may have less power than 
a statistic that focuses more on 'average behavior' (such as a median p-value or a sum of 
standardized region- wise test statistics). This is true when small effects add up to one larger 
cluster- wise effect, which is quite common in mRNA gene expression studies. However, we 
believe the following scenarios to be more relevant for the clustered array CGH data: 1. a 
cluster is homogenous (large 7) , and hence high positive dependencies between regions within 
the cluster arc present: 2. a cluster is heterogenous, and for only a few regions the association 
exists. It is clear that in the first scenario little power is lost when using niinp with respect 
to f.e. medianp, while in the second scenario medianp has less power than minp. Using the 
hierarchical testing approach, the latter could, after rejection of Hq, still identify significant 
regions in the heterogenous cluster. 

Results 

We analyze two data sets from Chin et al. (2006) (Datal) and from Douglas et al. (2004) 
(Data2). Both data sets have been discretized to ternary values, 0,-1 and 1 (Van de Wiel 



et al., 2006). After collapsing, Datal contains 383 regions in rows and 96 and 49 samples 
in columns, which correspond to ER-positive and ER-negativc breast cancer samples, re- 
spectively. Data2 contains 436 regions in rows and 7 and 30 samples in columns involved 
in colorectal cancer, representing two group states: microsatellite instable and chromosomal 
instable, respectively. 

We first concentrate on the clustering results. For the clustering algorithm, the maximum 
number of regions per cluster is constrained to 9, which we generally found to be sufficiently 
large. Parameter q, the decreasing rate of spatial dependence is set to 1. Finally, 7 is re-scaled 
by 7 = {e^ — l)/{e' + 1) so that it lies between —1 to 1 as the nominal correlation coefficient 
(equation (6.5.10) in p. 430 of Cressie (1993)). 

Figures 2 and 3 illustrate the clustering results for Datal and Data2, respectively. The 
number of clusters is 63 and 69, respectively. The 10-fold cross-validation stability score, the 
adjusted average Rand index (5), is high for both data sets: 0.969 and 0.963, respectively. 
These indicate that the cluster results do not strongly depend on the in- or exclusion of 10% 
of the samples. 

For both data sets we also investigate the probability on coincidental clustering. More 
precisely, we show that for both data sets the probability that two (consecutive) regions in a 
cluster are really dependent is high. Following the argument outlined in the Methods section 
(see equation (4)) we first need to show that the probability that two consecutive regions 
cluster, P{E), is fairly large. Given the relatively small number of clusters in both data sets, 
this is the case. Next, we need to show that probability that two independent regions cluster, 
P{E\D'), is low by considering shuffled data sets. Indeed, using 25 shuffled data sets, we 
observe that for Data 1 (383 regions), on average 382.2 (range: 381-383) clusters are formed, 
while for Data 2 (436 regions) on average 434.5 (range: 432-436) clusters are formed. Hence, 
we are confident that clusters in the original data sets are almost always created because of 
genomic spatial dependency. 

For testing association of the class labels with the group labels, 20.000 permutations of 
the group labels are performed. The above hierarchical testing procedure was applied to 
the clusters and regions. Tables 2 and 3 contain the results of two significant clusters for 
both data sets, whereas Figures 4 and 5 combine the overall collapsing, clustering and testing 
results. Note that for both data sets the two groups are rather unbalanced. Hence, when a 
peak in the differential proportion of either gains or losses is caused by a large proportion of 
that aberration in the smaller group, these tend to be less significant (f.e. second cluster on 
chromosome 16, Figure 4). 

To elucidate the potential benefit of the clusters in a testing setting we compared the 
results of the hierarchical testing procedure with a simple Holm step-down procedure applied 
to the regions alone. Two FWER cut-offs were considered: a = 0.05 and a = 0.1. Our find- 
ings can be summarized as follows. For both data sets and both cut-offs the Holm procedure 
only identifies regions that are part of a cluster identified as significant by the hierarchical 
procedure. For Datal, a = 0.05, the hierarchical procedure identifies 8 clusters, two of which 
contain no regions that are identified by the Holm procedure. Holm identifies 13 regions, hi- 
erarchical two extra, 15. For Datal, a = 0.1, the hierarchical procedure identifies 9 clusters, 
for which each contains at least one region that is identified by the Holm procedure. Both 
procedures identify the same 18 regions. For Data2, a = 0.05, the hierarchical procedure 
identifies 1 significant cluster of which none of the regions are identified by Holm. No signifi- 
cant regions are identified by both. For Data2, a = 0.1, the hierarchical procedure identifies 
3 significant clusters. Two clusters contain not a single significant region according to Holm. 





Figure 2: Clustering results for Datal. Losses are plotted in red, normals in black, gains in 
green. Clusters are order according to chromosomal position from bottom to top and depicted 
alternately in yellow and blue. Sample labels are plotted on the bottom axis, "+" indicates 
an ER-positive sample, "-" an ER-negative one. 



Figure 3: Clustering results for Data2. See caption Figure 2. 



Both Holm and hierarchical identify (the same) one region. In summary, both procedures are 
comparable in terms of the number of identified regions, but the hierarchical procedure has 
a clear advantage: it is able to identify clusters of which none of the regions are identified 
by Holm. Meinshausen (2008) shows similar results using an hierarchical Bonferroni-type 
procedure for other data types. 

We also compared the hierarchical procedure with a Holm step-down procedure on clusters. 
As expected, the number of detections using these procedures differs very little, because 
clusters are the highest level in the hierarchy and the proportion of differential clusters is 
small. 

Discussion 

We introduced a conceptual idea for spatial high-dimensional data: dimension reduction in 
two steps, collapsing and clustering. These two steps are performed separately: fast collapsing 
on all array features and more rigorous, model-based clustering on the resulting regions, much 
fewer in number. 

Conditional versus unconditional null-hypotheses 

The examples presented here require only one data set to be available for the purpose of 
both clustering and testing. This implies that the null-hypotheses are conditional on the 
clusters. However, the methodology developed here also applies to the unconditional setting, 
where the inference is performed on samples independent of those used for clustering. First, 
note that since the clustering algorithm always separates chromosomes, the conditional and 
unconditional nulls are likely to be very accordant on the chromosome level. 




Figure 4: Clustering plus testing, Datal. Chromosomes on the bottom axis, separated by solid 
lines. Alternating light-grey and white bars demarcate clusters. Crosses show the association 
coefficients 7 (scale on left axis). Dark-grey bars: significant clusters, tick marks at the top 
axis: significant regions (a = 0.1). Bottom plot: absolute difference between proportions of 
gains in the two groups (green) and between proportions of losses (red) (scale on right axis). 
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Figure 5: Clustering plus testing, Data2. See caption Figure 4. 



Table 2: Comparison of p- values for two clusters in Datal. Columns 2-4 and 7-9 denote the 
raw (unadjusted), Holm and hierarchical p- values for clusters and regions, respectively. Fifth 
column contains the association coefficient 7. 
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Table 3: Comparison of p-values for two clusters in Data2. See caption Table 2. 
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In the conditional setting, wc emphasize that the FWER on clusters estimated from 
the same data is well-defined, albeit from conditional, and hence random, null hypotheses 
(because the clusters are random). This point of view was investigated by Perone Pacifico 
et al. (2004) in the context of a clustered version of FDR. However, as opposed to approach 
of Perone Pacifico et al. (2004) using p-value-based clustering, we argue that the conditional 
approach introduced here clearly delineates the information in the data which is used to make 
clusters from the information used to perform testing on the clusters, because the clusters 
are permutation invariant, so that we obtain an interpretable procedure on the permutation 
probability space as well. 

We acknowledge that the conditional setting is not fully in accordance with the traditional 
hypothesis testing setting, which is why one might prefer the unconditional setting in some 
cases. Reasons why we emphasize the conditional setting are the following. Tumor array CGH 
is still very expensive, and, more importantly, for many clinical studies it is not practically 
feasible to obtain good quality tumor material for a large amount of samples. Moreover, cancer 
is a very heterogeneous disease, and hence the power to detect differences between groups for 
particular genomic locations is likely to be small for small sample sizes. For example, some 
aberrations occur in only 5% of the entire population, but it may still be relevant to detect 
such an aberration when (most of) it 5% would belong to one group. Therefore, one often 
prefers to use as many samples as possible in the testing phase, and not 'sacrifice' samples for 
the purpose of clustering. However, the unconditional setting may be particularly attractive 
when aCGH data {X) is externally available, but the response (Y) is not. In such a case, the 
external data can anyhow not be used in the testing phase. Our software easily applies to 
such a setting as well. 

Finally, we studied to what extent the results of the conditional and unconditional ap- 
proaches differ for Datal. We repeatedly split this data set in two parts. The conditional 
approach uses the second part for both clustering and testing, whereas the unconditional 
approach uses the first part for clustering and the second for testing. We registered which 
regions are significant and which regions are a member of a significant cluster, at a = 0.2. 
Summarizing, on average 94.3% (99.9%) of the regions rejected (non-rejected) by the con- 
ditional approach were confirmed by the unconditional approach, whereas on average 93.2% 
(99.8%) of the regions that are member of a rejected (non-rejected) cluster as determined by 
the conditional approach were confirmed by the unconditional approach. Hence, it is com- 
fortable to notice that the stability of the clustering implies a high agreement between the 
unconditional and conditional approach, whereas the latter uses only half of the data. 

Other issues and conclusion 

A byproduct of our clustering method is the strength measure of association (7) within a 
cluster. This helps in interpreting the results of the hierarchical multiple testing on clusters 
and regions: if 7 is relatively small for a significant cluster, one expects that the significance 
is driven by a few regions within the cluster, which should be reflected in the adjusted region- 
wise p- values. For large values of 7 one expects rather similar adjusted region- wise p- values. 
Hence, to some extent 7 and the region-wise p-values may help in distinguishing genomic 
regions that are potentially causally related to the response and those that are just correlated 
with neighboring, more strongly associated regions. In any case, further biological validation, 
also at other molecular levels, such as mRNA or protein, is needed to decide which genomic 
DNA regions are really causally related to the response. 



Some prefer the use of undiscretized rather than discretized array CGH data for testing 
association with cHnical information. Wc beheve our cluster-testing approach to be useful 
in this setting as well. The clustering would then group features that share the underlying 
discrete characteristics of the data, while the corresponding undiscretized data would be used 
to achieve (supposedly) more power in the permutation testing procedure. 

For clusters that contain regions that are amplified (high copy number gain) rather than 
gained, one could apply our algorithm on a 4-digit code (-1,0,1,2) but this may be very time- 
consuming. When few losses (-1) are present in such a cluster (which is not unlikely given 
the presence of amplifications), one may locally re-define the 4-digit code to a 3-digit one 
where the losses and normals would be joined in one class. Another alternative solution is 
to keep the 4-digit code, but instead of using an extra parameter in model (1) use the same 
parameter as for = 1 (gains), but now with Xi = 2. This would give more weight to two 
consecutively amplified regions than for two consecutively gained regions. Double deletions 
may be dealt with in a similar way. 

Our clustering algorithm suggests a development for DNA probe design in low dimen- 
sional platforms such as MLPA (Schouten et al., 2002). If strongly associated clusters could 
be verified in further studies, then one may consider to use only one MLPA probe per cluster. 
Such low dimensional platforms arc usually less costly, more reproducible and more straight- 
forward to analyze. Another application of our genomic clusters is the clustering of samples 
and prediction of clinical outcome. Van Wieringen et al. (2008) show that use of genomic 
regions rather than individual probes to cluster samples based on array CGH data enhances 
clustering performance. In prediction, the smaller number of clusters may result in simpler 
and more robust feature selection. 

In summary, we developed a dedicated clustering algorithm which, in combination with 
permutation testing, allows a multiple resolution perspective on association of array CGH 
data with clinical information. With the introduction of extremely high-resolution data, f.e. 
obtained by massive parallel sequencing, the need for such methods will only increase. 
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Appendix I: Software 

The methods discussed here, both for clustering and hierarchical testing, are implemented 
in the R-package dnaCplusT. The package contains example data sets, documentation on 
functions and parameters and the actual R-code. For the implementation of the procedures, 
we used R-Bioconductor software packages RBGL, partitions and multtest. The package is 
available from the last author's web site: http:/ /www. few. vu.nl/~mavdwiel/dnaCplusT. html. 



Appendix II: Validity of permutation test 



Here, we prove the validity of the permutation tests in our setting. In this section, let us 
write Xj = Xj'"' for short. Under the null 

Hqj: Y and Xj are independent conditionally on A{X) = A, 

the following proves that the distribution of {Y,Xj) is equal to the distribution of (V^^Xj), 
conditionally on A{X) = A, for any deterministic permutation a of {l,...,n} (where the 
superscript "a" codes for a cj-column-wise permutation of the vector or of the matrix): 

P{Y = y'',X,=x,\A{X) = A) 

= P(y = y^lAiX) = A)P{Xj = Xj\A{X) = A) 
= P(y^ = y^lAiX") = A)P{Xj = Xj\A{X) = A) 
= P(y^ = y^lAiX) = A)PiXj = Xj\A{X) = A) 
= P(y = y\A{X) = A)P{Xj = Xj\A{X) = A) 
= P{Y = y,Xj = xj\A{X)=A). 

Similarly, we may prove that under the null 

Hq-. Y and {Xj)j^Ae are independent conditionally on A{X) = A, (8) 

the distribution of (Y, {Xj)j,zA^) is equal to the distribution of (Y'^, [Xj)j^A^), conditionally 
on A{X) = A, for any deterministic permutation a of {1, ...,n}. 

Null-hypothesis (8) is sufficient for applying permutation and allows for FWER control 

using the inheritance principle. However, in order the sharpen FWER control using Shaffer's 
(1986) improvement, we need to assume that (8) is implied by 

because the intersection hypothesis contains the logical relationship between the region-wise 
null- hypotheses and the cluster hypothesis needed to apply this improvement. This assump- 
tion means that the dependency between Y and the vector {Xj)j^Ai is fully described by 
dependencies between Y and Xj^s, j G A£. So for example (Y _L Xi) A (y _L X2) ^ Y ± 
(Xi, X2). Note the similarity between this assumption and one often made in step- wise regres- 
sion (with random covariates), where an interaction term X1X2 (which models dependency 
between Y and {Xi,X2)) is only considered once at least one of the main terms (Xi or X2) 
is present in the model. 

Appendix III: Proof of FWER control 

Here, we provide a proof that the hierarchical multiple testing TZ controls the FWER both 
on the cluster and on the region levels. We consider the case where we include the Shaffer 
correction (the other case is similar). The proof was provided in Goeman and Finos (2010), 
but for a different type of hypothesis weighting. Our arguments are very similar. 



The proof uses the so-cahed "sequential rejection principle, presented by Goeman and 
Solari (2009) (see also Arlot et al. (2010)). Our hierarchical rejection procedure TZ, rejecting 
both clusters A ^ A and regions j G A, can be expressed as the following sequential rejective 
procedure: TZ = TZk where TZq = ^ and for all i>0, 

7^^+l =niU{Ae A\PA < a7^J U (J {j e A\pj < a7^,,A}, 

AeA 

and where k is the first i > for which T^^+i = T^j. In the above recursion relation, pA and 
Pj are the p- values for cluster A and region j, respectively, while the threshold on clusters 
is a-ji = a/{KA. — \Etz.\), with E-ji = {A G 7^ fl .4|^ C Tt}, and the threshold on regions 
is a-ji^A = a7^1{^ G ^ n 7^}/(|^| — max(l, \ A fl TZ])). In the latter thresholds, we use the 
convention 1/0 = +oo and 0/0 = 0. 

We aim to establish that our procedure controls the hierarchical FWER at a pre-specified 
level a. Prom the sequential rejection principle, the latter is true, as soon as both a "mono- 
tonicity condition" and a "single step condition" hold (Goeman and Solari, 2009). The first 
condition is satisfied because a-jz and a-jz^A are nondecreasing in TZ. The second condition is 
satisfied if we have 

AeAnHo AeAjeAnno 

where we denoted by Hq the set of the true clusters and regions and by its complementary. 
We now prove (9) : we have 



AeAjeAnHo AeAnu^ ' ' ^ " "'^ 

Now, the sum appearing in the right hand side of the above relation can be taken only over 
A e ADHq such that |^ n "Hoi 7^ 0, that is over A e An 'Ho\E-hc. Moreover, we may use 

the logical relation between the hierarchical hypotheses saying that if a cluster is false then 
at least one of its regions is false: A e ACiHq implies \A n T-Lo] < \A\ — 1. Combining these 
two facts, we obtain 



AeAjeAn-Ho AnH^\E-n 



\Anno\ 



\A\ -max(l,|An?^g| 







<aws Y 



min(|An-Hol, 1^1 - 1) 



\A\-max{l,\An'H^\) 
= an^^{\An'H',\-\Enc\) 

Thus, we have 



Y "^n + Y Y '^n'A<an^,{\Anno\ + \Ann'o\-\Enc\) 

AeAnHo AeAjeAnHo 

= olhi,{Ka - l^^^^gl) = a, 



which proves (9) and the required FWER control. 
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